Skip to content

feat(eval): configurable trials, parallelism, and CI threshold support#122

Merged
aroff merged 1 commit intomainfrom
feature/049-eval-trials-parallelism
Apr 20, 2026
Merged

feat(eval): configurable trials, parallelism, and CI threshold support#122
aroff merged 1 commit intomainfrom
feature/049-eval-trials-parallelism

Conversation

@aroff
Copy link
Copy Markdown
Contributor

@aroff aroff commented Apr 20, 2026

Summary

  • Adds trials_per_case, parallel, and pass_threshold fields to EvalConfigToml / EvalConfig with backward-compatible defaults (trials=1, threshold=1.0)
  • Introduces TrialResult and CaseTrialsResult types; trial artifacts written to {run_dir}/{case_id}/trial-N/ with aggregated.json per case
  • Extends EvalRunner trait with run_case_trials() backed by JoinSet + Semaphore bounded concurrency
  • Adds --trials, --ci, and --threshold CLI flags to eval run; --ci gates exit code on suite pass rate vs threshold
  • Emits cost warning when trials × cases >= 100

Test plan

  • All 150 unit tests and 48 eval integration tests pass (cargo nextest run -E 'test(eval)')
  • test_eval_run_trials_threshold_and_ci_exit_semantics — verifies 3/5 pass at threshold=0.6 succeeds; 3/5 at threshold=1.0 fails
  • test_eval_run_parallelism_reduces_wall_time — 4 trials × 0.5s sleep complete in <1.6s with parallel=4
  • Snapshot eval_run_help updated to include --trials, --ci, --threshold
  • Existing single-trial projects continue working unchanged (backward compatible defaults)

…pport

Extends the eval system to run multiple trials per case with bounded
concurrency and deterministic pass-rate aggregation. Adds --trials,
--ci, and --threshold CLI flags; trial artifacts are written under
{run_dir}/{case_id}/trial-N/ with aggregated.json summaries. Existing
single-trial configs continue working without change.
@aroff aroff merged commit 88d1c2b into main Apr 20, 2026
11 checks passed
@aroff aroff deleted the feature/049-eval-trials-parallelism branch April 20, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant